| Project_ID | Project_Name | Project_Development_Objective |
|---|---|---|
| P127665 | Second Economic Recovery Development Policy Loan | This development policy loan supports the Government of Croatia's reform efforts with the aim to: (i) enhance fiscal sustainability through expenditure-based consolidation; and (ii) strengthen investment climate. |
| P179010 | Tunisia Emergency Food Security Response Project | To (a) ensure, in the short-term, the supply of (i) agricultural inputs for farmers to secure the next cropping seasons and for continued dairy production, and (ii) wheat for uninterrupted access to bread and other grain products for poor and vulnerable households; and (b) strengthen Tunisia’s resilience to food crises by laying the ground for reforms of the grain value chain. |
A blog-form recap of the first exploratory stage of this learning project
Motivation
I have always been fascinated by the idea of analyzing language as data and I finally found some time to study Natural Language Processing (NLP) and Text Analytics techniques.
In this project, I explore a dataset of World Bank Projects & Operations, with a focus on the text data contained in the Project Development Objective (PDO) section of the Bank’s projects. The PDO outlines, in synthetic form, the proposed objectives of the project, defined in early stages of the World Bank project cycle.
Table 1 shows a couple of examples. Normally, a few objectives are listed in paragraphs that are a couple sentences long.
The dataset also includes some relevant metadata about the projects, including: country, fiscal year of approval, project status, main sector, main theme, environmental risk category, or lending instrument.
Overview of available data
The original dataset contained 22,569 WB projects approved between FY1947 and FY2026 as of August 31, 2024. Of these, approximately 50% (11,322 projects) had a “viable” PDO text in the dataset (i.e., not blank or labeled as “TBD”, etc.), all approved after FY2001. From this subgroup, additional projects were excluded for incompleteness: 3 projects lacking project status, 2,176 projects without “board approval FY”, and 332 projects with approval still pending as of September 2024.
Thus, the usable data for the analysis consists of 8,811 projects.
Surprisingly, among these clean projects, 2,235 unique projects share only 1,006 unique PDOs among them—“recycled PDOs” seem to occur in cases of follow-up projects or components of a parent project.
Finally, from the remaining pool of 8,811 projects, I focused on a representative sample of 4,403 projects with PDO.
First, it is important to notice that all 7,548 projects approved before FY2001 had no PDO text available.
The exploratory analysis of the 11,353 projects WITH PDO text revealed some interesting findings:
- PDO text length: The PDO text is quite short, with a median of 2 sentences and a maximum of 9 sentences.
-
PDO text missingness: besides 11,306 projects with missing PDOs, 31 projects had some invalid PDO values, namely:
- 11 have PDO as one of: “.”,“-”,“NA”, “N/A”
- 7 have PDO as one of: “No change”, “No change to PDO following restructuring.”,“PDO remains the same.”
- 9 have PDO as one of: “TBD”, “TBD.”, “Objective to be Determined.”
- 4 have PDO as one of: “XXXXXX”, “XXXXX”, “XXXX”, “a”
Of the remaining 11,322 projects with a valid PDO, some more projects were excluded from the analysis for incompleteness:
- 3 projects without “project status”
- 2,176 projects without “board approval FY”
- 332 projects approved in FY >= FY2024 (for incomplete approval stage)
Lastly (and this was quite surprising to me) the remaining, viable 8,811 unique projects, were matched by only 7,582 unique PDOs! In fact, 2,235 projects share 1,006 NON-UNIQUE PDO text in the “cleaned” dataset. Why? Apparently, the same PDO is re-used for multiple projects (from 2 to as many as 9 times), likely in cases of follow-up phases of a parent project or components of the same lending program.”
In sum, the cleaning process yielded a usable set of 8,811 functional projects, which was split into a training subset (4,403) to explore and test models and a testing subset (4408), held out for post-prediction evaluation.
Preprocessing the PDO text data
Cleaning text data entails extra steps compared to numerical data. A key process is tokenization, which breaks text into smaller units like words, bigrams, n-grams, or sentences. Another common cleaning task is normalization, where text is standardized (e.g., converting to lowercase). Similarly, data reduction techniques like stemming and lemmatization simplify words to their root form (e.g., “running,” “ran,” and “runs” become “run”). This can help to reduce dimensionality, especially with very large datasets, when the word form is not relevant.
After tokenization, it is very common to remove irrelevant elements like punctuation or stop words (unimportant words like “the”, “i)”, “at”, or obvious ones in context like “pdo”) which add noise to the data.
In contrast, data enhancement techniques like part-of-speech tagging add value by identifying grammatical components, allowing focus on meaningful elements like nouns, verbs, or adjectives.
Term Frequency
Figure 1 shows the most recurrent tokens and stems in the PDO text data. Evidently, after stemming, more words (or stems) reach the threshold frequency count of 800 (as they have been combined by root). DEspite the pre-processing of PDOs’ text data, these aren’t particularly informative words.
Interesting patterns in the PDO text data
Metadata quality enhancement with ML predictive models
The idea is to predict the “missing” tabs (sector, environmental risk category, etc.) in the World Bank project documents, using the text of the Project Development Objective (PDO) section as input data.
Steps of prediction
-
label engineeringDefine what we want to predict (outcome variable, \(y\)), and its functional form (binary or multiclass, log form or not if numeric)- Deal with missing value in \(y\) (understand if there are systematic reasons for missingness, and if so, how to address them) + Deal with extreme values of \(y\) (conservatively is best)
-
sample designSelect the observations to use .- For high external validity it will have to be as close as possible to the population of interest (patterns of variables’ distribution etc.)
-
feature engineeringDefine the input data (predictors, \(X\)) and their format (text, numeric, categorical)- Deal with missing values in \(X\) (understand, variable by variable, the reasons for missingness, and decide what to do: keep, impute value if numeric, drop the predictor?)
- Select the most relevant predictors (which \(X\) to have and in which form). For text predictor data, there are specific NLP transformations that can be applied (e.g. tokenization, lemmatization, etc.)
- In some cases interaction between predictor variables makes sense.
- Alternative models can be build with less predictors in simpler form to compare with others with more predictors in more complex form… Here domain knowledge + EDA are key to decide what to include and what to exclude.
- Deal with missing values in \(X\) (understand, variable by variable, the reasons for missingness, and decide what to do: keep, impute value if numeric, drop the predictor?)
-
model selectionIt’s impossible to try all possible models (i.e. all possible choices of \(X\) variables to include, their possible functional form, and their possible interactions give too many combinations).-
cross-validationis similar to training-test method, which basically splits the data into training and test sets, but it does this multiple times (e.g.k-fold cross-validationmeans \(k\) = 10 test sets) and it helps selecting the best model without overfitting. - Here we do all the work said above (model building and best model selection) in the work set. This will be further divided \(k-times\) into \(k\) train-test splits, then we use the holdout set to evaluate the prediction itself.
-
-
last_fitmeans that, once the best model(s) is/are selected, they are re-run on all of the work set (training data) to evaluate the performance to obtain the final model. -
post-prediction diagnostic, lastly, serves to evaluate the model’s performance on the hold-out sample instead. Here we can- evaluate the fit of the prediction (using, MSE, RMSE, accuracy, ROC etc. to summarize goodness of fit
- (for continuous \(y\)) we can visualize the prediction interval around the prediction, for discrete \(y\) the confusion matrix.
- we can zoom in on the kinds of observations we care about the most or look at the fit in certain sub-samples of the data (e.g. by sector, by year, etc.)
- finally we should assess the external validity (hold out set helps but is not representative of all the “live data”)
LASSO (Least Absolute Shrinkage and Selection Operator) is a sort of “add-on” to linear regression models which, by adding a penalty system, finds a way to get better predictions from regressions with many predictors, by selecting a subset of the predicting variables that helps to avoid overfitting. The output of the LASSO algorithm is the values of the coefficients of the predictors that are kept in the model. In the formula \(λ\) is the tuning parameter term, which is a parameter that can be tuned to get the best model.
Parting thoughts and next steps
-
Evidently, this project was primarily a learning / proof-of-concept exercise, so I wasn’t concerned with in-depth analysis of the data, nor with maximizing ML models’ predictive performance. Nevertheless, this initial exploration demonstrated the potential of applying NLP techniques to unstructured text data to uncover valuable insights, such as:
- detecting frequency trends of sector-specific language, and topics over time,
- improving documents classification and metadata tagging, via ML predictive models,
- uncovering surprising patterns and relationships in the data, e.g. recurring phrases or topics,
- triggering additional text-related questions that could lead to further research.
-
Next steps could include:
- delving deeper into hypothetical explanations for the patterns observed, e.g. by combining NLP on this document corpus with other data sources (e.g. information on other WB official documents and policy statements);
- exploring more advanced NLP techniques, such as Named Entity Recognition (NER), Structural Topic Modeling (STM), or BERTopic, to enhance the analysis and insights drawn from World Bank project documents.
A pain point in this type of work is efficiently retrieving input data from document corpora. Despite the World Bank’s generous “Access to Information” policy, programmatic access to its extensive text data resources is still quite hard (no dedicated API, various stale pages and broken links). This should be addressed, perhaps following the model of the World Development Indicators (WDI) data, much more accessible and well-curated.
Amid the ongoing hype around AI and Large Language Models (LLMs), this kind of analysis seems like yesterday’s news. However, I believe there is still a huge untapped potential for meaningful applications of NLP and text analytics in development studies, policy analysis, and other areas—which will be even more impactful if informed by domain knowledge.
TOOL
xxx
xxx
xxx
xxx
xxx
Acknowledgements
Below are some valuable resources to learn and implement NLP techniques—especially geared toward R programming.